Do not cast .transform() output back to input dtype (closes #10972) #15256

nbonnotte · 2017-01-29T13:45:47Z

closes BUG: output of a transform is cast to dtype of input #10972
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Maybe need more tests, and check performance?

jreback

this needs perf (just run the asv for groupby) and see

nbonnotte · 2017-01-29T16:34:45Z

Well, each time I run asv, I get different results, but after closing everything and leaving my computer alone, here is what I get:

   before     after       ratio
  [3853fe6d] [2bc47491]
+   19.84ms    22.47ms      1.13  groupby.GroupBySuite.time_diff('float', 100)
+   12.45ms    13.92ms      1.12  groupby.FirstLast.time_groupby_nth_none('float64')
-    1.97ms     1.78ms      0.90  groupby.GroupBySuite.time_size('int', 10000)
-   27.05ms    24.28ms      0.90  groupby.GroupBySuite.time_rank('float', 100)
-   19.79ms    17.21ms      0.87  groupby.FirstLast.time_groupby_nth_none('object')
-   18.12ms    15.64ms      0.86  groupby.FirstLast.time_groupby_nth_none('datetime')
-   10.96ms     9.37ms      0.86  groupby.FirstLast.time_groupby_last('datetime')

jreback · 2017-01-29T16:37:18Z

those are not the transform asv

nbonnotte · 2017-01-29T17:11:16Z

I run asv continuous -f 1.1 -E virtualenv upstream/master HEAD -b ^groupby, but it's the first time I do perf, so I may have missed something :)

Is there something else I should do for this PR?

nbonnotte · 2017-01-29T17:31:27Z

I think Appveyor is complaining about int being int32 on windows instead of int64. Is there a way to cast them to int64 instead?

jreback

don't change any tests; you have to maintain the existing unless their is a really good reason

tests breaking generally means something is wrong in your code

i suspect your changes are not propagating dtypes correctly

add the tests from the issue

jreback · 2017-01-29T18:23:56Z

pandas/tseries/tests/test_resample.py

@@ -1805,7 +1805,7 @@ def test_resample_median_bug_1688(self):
                                          datetime(2012, 1, 1, 0, 5, 0)],
                           dtype=dtype)

-            result = df.resample("T").apply(lambda x: x.mean())
+            result = df.resample("T").mean()
            exp = df.asfreq('T')


why r u changing tests like this?

jreback · 2017-01-29T18:24:21Z

pandas/tests/groupby/test_groupby.py

@@ -5669,7 +5668,7 @@ def test_ops_general(self):
        labels = np.random.randint(0, 50, size=1000).astype(float)

        for op, targop in ops:
-            result = getattr(df.groupby(labels), op)().astype(float)
+            result = getattr(df.groupby(labels), op)()


don't change tests

your changes are not robust

codecov-io · 2017-01-29T20:12:45Z

Codecov Report

Merging #15256 into master will not impact coverage.

@@           Coverage Diff           @@
##           master   #15256   +/-   ##
=======================================
  Coverage   86.33%   86.33%           
=======================================
  Files         139      139           
  Lines       51149    51149           
=======================================
  Hits        44157    44157           
  Misses       6992     6992

Impacted Files	Coverage Δ
pandas/core/groupby.py	`95.15% <100%> (ø)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c26e5bb...ec52ab4. Read the comment docs.

nbonnotte · 2017-01-29T20:17:44Z

@jreback Hum, this is more complex than I thought

The initial problem is: .transform cast output dtypes back to input dtype, so .size for instance produce timestamps if the initial data was timestamps

If I keep the output type as it is, then Series.mean, for instance, will produce float. So if the original data is float32, performing .groupby().apply(lambda g: g.mean()) might thus result in float64 output, but .groupby().mean() will stay in float32.

Am I missing something obvious? How to proceed from there?

jreback · 2017-01-29T20:19:31Z

you need to selectively cast

all lost every function should be cast except for things that produce ints (like size)

nbonnotte · 2017-01-29T20:27:22Z

So .groupby().agg(lambda : float(len(x))) should keep producing timestamps if input is timestamps?

(That's how I stumbled on this issue when working on my PR #12607)

jreback · 2017-01-29T20:35:44Z

@nbonnotte of course not, that's the point of this issue!

you simply need to infer based on the return values, not automatically cast it.

you might be able to simply stick it into a Series and it will work, BUT, you will still need to cast if its a numeric type (IOW a float) to the original type flavor.

my points above is that you need to really look at the tests create code that DOESN't change anything except for specific tests (e.g. the test from the actual issue).

nbonnotte · 2017-01-30T20:05:42Z

Ok. I'm going to take it slow, starting with datetime input

jreback · 2017-03-02T23:19:30Z

this was fixed by 251826f

nbonnotte · 2017-03-03T07:56:34Z

It looks so simple once you see the solution... Thanks @jreback for your feedback and your patience, sorry I couldn't get that through

jreback reviewed Jan 29, 2017

View reviewed changes

jreback requested changes Jan 29, 2017

View reviewed changes

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Jan 29, 2017

Do not cast .transform() output back to input dtype: datetime input

ec52ab4

jreback closed this Mar 2, 2017

jreback added this to the 0.20.0 milestone Mar 2, 2017

nbonnotte deleted the 10972-wrong-cast branch March 3, 2017 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not cast .transform() output back to input dtype (closes #10972) #15256

Do not cast .transform() output back to input dtype (closes #10972) #15256

nbonnotte commented Jan 29, 2017 •

edited

Loading

jreback left a comment

nbonnotte commented Jan 29, 2017

jreback commented Jan 29, 2017

nbonnotte commented Jan 29, 2017

nbonnotte commented Jan 29, 2017

jreback left a comment

jreback Jan 29, 2017

jreback Jan 29, 2017

codecov-io commented Jan 29, 2017 •

edited

Loading

nbonnotte commented Jan 29, 2017

jreback commented Jan 29, 2017

nbonnotte commented Jan 29, 2017

jreback commented Jan 29, 2017 •

edited

Loading

nbonnotte commented Jan 30, 2017

jreback commented Mar 2, 2017

nbonnotte commented Mar 3, 2017

Do not cast .transform() output back to input dtype (closes #10972) #15256

Do not cast .transform() output back to input dtype (closes #10972) #15256

Conversation

nbonnotte commented Jan 29, 2017 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

nbonnotte commented Jan 29, 2017

jreback commented Jan 29, 2017

nbonnotte commented Jan 29, 2017

nbonnotte commented Jan 29, 2017

jreback left a comment

Choose a reason for hiding this comment

jreback Jan 29, 2017

Choose a reason for hiding this comment

jreback Jan 29, 2017

Choose a reason for hiding this comment

codecov-io commented Jan 29, 2017 • edited Loading

Codecov Report

nbonnotte commented Jan 29, 2017

jreback commented Jan 29, 2017

nbonnotte commented Jan 29, 2017

jreback commented Jan 29, 2017 • edited Loading

nbonnotte commented Jan 30, 2017

jreback commented Mar 2, 2017

nbonnotte commented Mar 3, 2017

nbonnotte commented Jan 29, 2017 •

edited

Loading

codecov-io commented Jan 29, 2017 •

edited

Loading

jreback commented Jan 29, 2017 •

edited

Loading